Operate on GCS data using PySpark

The command given below will read data from a GCS file, do some opearations on it and then save the result again on a GCS folder.


In [ ]:
# Pass the input file name from the GCS
input_file = "gs://my/GCS/input/file.read"
output_file = "gs://my/GCS/output/file.write"

In [ ]:
lines = sc.textFile(input_file)
words = lines.flatMap(lambda line: line.split())
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)
wordCounts.saveAsTextFile(output_file)